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1 Predictive techniques for aggressive load speculation 
Glenn Reinman, Brad Calder 
November 1998 Proceedings of the 31st annual ACM/IEEE international symposium on 
Microarchitecture 

Full text available: ^ pdf(1.94 MB) Additional Information: full citation , references , citings , index terms 



2 Predictor-directed stream buffers 

Timothy Sherwood, Suleyman Sair, Brad Calder 

December 2000 Proceedings of the 33rd annual ACM/IEEE international symposium on 
Microarchitecture 

Full text available: || pdf(187.89 KB) 
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Microarchitectures: Scalin g the issue window with look-ahead latency prediction 
Yongxiang Liu, Anahita Shayesteh, Gokhan Memik, Glenn Reinman 
June 2004 Proceedings of the 18th annual international conference on 
Supercomputing 

Full text available: ^ pdf(302.66 KB) Additional Information: full citation , abstract , references , index terms 

In contemporary out-of-order superscalar design, high IPC is mainly achieved by exposing 
high instruction level parallelism (ILP). Scaling issue window size can certainly provide more 
ILP; however, future processor scaling demands threaten to limit the size of the issue 
window. In this study, we propose a dynamic instruction sorting mechanism that provides 
more ILP without increasing the size of the issue window. In our approach, early in the 
pipeline, we predict how long an instruction needs to ... 

Keywords: CLP, LHT, MNM, SILO, instruction sorting 
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Microarchitecture support for improving the performance of load target prediction 
Chung-Ho Chen, Akida Wu 

December 1997 Proceedings of the 30th annual ACM/IEEE international symposium on 
Microarchitecture 
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Presents a load target prediction scheme that mitigates the impact of load latency for 
modern microprocessors. The scheme uses a cache-like buffer to provide the base address, 
offset and operand size at the instruction fetching stage of a pipeline so that a load target 
address can be computed earlier at the decode stage. With the dynamic use of a load stride, 
the scheme has achieved a prediction rate that is 15% higher than a previously proposed 
approach. By providing a 128-entry direct-mapped I ... 

Keywords: load target prediction, load-use stall, pipeline, speculative data access, 
superscalar procesor 



5 Examination of a memory access classification scheme for pointer-intensive and 
numeric programs 
Luddy Harrison 

January 1996 Proceedings of the 10th international conference on Supercomputing 
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6 Architecture: Continual flow pipelines 

Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton 
October 2004 Proceedings of the 11th international conference on Architectural 
support for programming languages and operating systems 

Full text available: ||| p d f( 274.26 KB) Additional Information: full citation, abstrac t, references , index terms 

Increased integration in the form of multiple processor cores on a single die, relatively 
constant die sizes, shrinking power envelopes, and emerging applications create a new 
challenge for processor architects. How to build a processor that provides high single-thread 
performance and enables multiple of these to be placed on the same die for high throughput 
while dynamically adapting for future applications? Conventional approaches for high single- 
thread performance rely on large and complex co ... 

Keywords: CFP, instruction window, latency tolerance, non-blocking 
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7 Speculation techniques for improving load related instruction scheduling 
Adi Yoaz, Mattan Erez, Ronny Ronen, Stephan Jourdan 

May 1999 ACM SIGARCH Computer Architecture News , Proceedings of the 26th 

annual international symposium on Computer architecture, volume 27 issue 2 
Full text available: g pdf(164.15 KB) Additional Information: full citation , abstract , references , citings, index 
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State of the art microprocessors achieve high performance by executing multiple 
instructions per cycle. In an out-of-order engine, the instruction scheduler is responsible for 
dispatching instructions to execution units based on dependencies, latencies, and resource 
availability. Most existing instruction schedulers are doing a less than optimal job of 
scheduling memory accesses and instructions dependent on them, for the following 
reasons:• Memory dependencies cannot be resolved prior ... 
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8 Value prediction in VL1W machines 
Tarun Nakra, Rajiv Gupta, Mary Lou Soffa 

May 1999 ACM SIGARCH Computer Architecture News , Proceedings of the 26th 

annual international symposium on Computer architecture, volume 27 issue 2 
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The performance of VLIW architectures is dependent on the capability of the compiler to 
detect and exploit instruction-level parallelism during instruction scheduling. To exploit the 
detected parallelism, instructions are reordered to reduce the length of the code schedule 
and minimize the cycle count for execution. Code reordering is limited by the dependencies 
among instructions arising from both control flow and data flow. In this paper, we present 
the design of a VLIW architecture that uses ... 



9 Clustered microarchitectures: C luster prefetch: tolera ting on -chi p wire de l ays in 
clustered microarchitectures 
Rajeev Balasubramonian 

June 2004 Proceedings of the 18th annual international conference on 
Supercomputing 

Full text available:^ pdf(1 53.44 KB) Additional Information: full citation , abstract , references , index terms 

The growing dominance of wire delays at future technology points renders a microprocessor 
communication-bound. Clustered microarchitectures allow most dependence chains to 
execute without being affected by long on-chip wire latencies. They also allow faster clock 
speeds and reduce design complexity, thereby emerging as a popular design choice for 
future microprocessors. However, a centralized data cache threatens to be the primary 
bottle-neck in highly clustered systems. The paper attempts to id ... 

Keywords: clustered microarchitectures, communication-bound processors, data prefetch, 
distributed caches, effective address and memory dependence prediction, processor 



1 0 Detectin g g lobal stride locality in value streams 
Huiyang Zhou, Jill Flanagan, Thomas M. Conte 

May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture, volume 3i issue 2 
Full text available: |£| pdf(292.41 KB) Additional Information: full citation , abstract , references , citings 

Value prediction exploits localities in value streams. Previous research focused on exploiting 
two types of value localities, computational and context-based, in the local value history, 
which is the value sequence produced by the same instruction that is being predicted. 
Besides the local value history, value locality also exists in the global value history, which is 
the value sequence produced by all dynamic instructions according to their execution order. 
In this paper, a new type value local ... 

11 Correlated load-address predictors 

Michael Bekerman, Stephan Jourdan, Ronny Ronen, Gilad Kirshenboim, Lihu Rappoport, Adi 
Yoaz, Uri Weiser 

May 1999 ACM SIGARCH Computer Architecture News , Proceedings of the 26th 

annual international symposium on Computer architecture, volume 27 issue 2 
Full text available: g pdf(149.18 KB) Additional Information: full citation , abstract , references , citings, index 
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As microprocessors become faster, the relative performance cost of memory accesses 
increases. Bigger and faster caches significantly reduce the absolute load-to-use time delay. 
However, increase in processor operational frequencies impairs the relative load-to-use 



http://portal.acm.org/resute^ 



11/20/04 



Results (page 1): + tf load predicted" +"cache miss" +compiler 



Page 4 of 4 



latency, measured in processor cycles (e.g. from two cycles on the Pentium® processor 
to three cycles or more in current designs). Load-address prediction techniques were 
introduced to partially cut the load-to-use latency. Thi ... 

Keywords: context-based predictor, global correlation, load-address prediction, predictor 
implementation, recursive data structures 



12 Enhancing software reliability with speculative threads 
Jeffrey Oplinger, Monica S. Lam 

October 2002 Proceedings of the 10th international conference on Architectural 

support for programming languages and operating systems, volume 37 , 36 , 

30 Issue 10 , 5 , 5 

Full text available: ^S| pdf(1.47 MB) Additional Information: full citation , abstract , references , citings 

This paper advocates the use of a monitor-and-recover programming paradigm to enhance 
the reliability of software, and proposes an architectural design that allows software and 
hardware to cooperate in making this paradigm more efficient and easier to program. We 
propose that programmers write monitoring functions assuming simple sequential execution 
semantics. Our architecture speeds up the computation by executing the monitoring 
functions speculatively in parallel with the main computation. For ... 
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